Regularization

Overfitting is bad for a machine learning model because it gives too much importance to the training data points and tries to fit a curve that passes through all the points. So, we need to reduce the importance given to these exact points and account for some sort of variance as well. This can be done by adding a penalty term to the loss function, discouraging the model from assigning too much importance to individual features or coefficients. This technique is called Regularization.

Regularization is an important technique in machine learning that helps to improve model accuracy by preventing overfitting, which happens when a model learns the training data too well, including noise and outliers, and performs poorly on new data. By adding a penalty for complexity, it helps simpler models perform better on new data.

There primarily exist 3 types of regularizations: Lasso (L1), Ridge (L2) and Elastic Net (L1-L2) regularizations. The only difference amongst them is what kind of penalty term is added.

There are several ways of controlling the capacity of Machine Learning models and Neural Networks to prevent overfitting:

1. Lasso Regression

L1 regularization is a relatively common form of regularization, where for each weight w we add the term
λ |w| to the objective. The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e., very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, with small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

A regression model that uses the L1 Regularization technique is called LASSO (Least Absolute Shrinkage and Selection Operator) regression. It adds the absolute value of the magnitude of the coefficient as a penalty term to the loss function (L).
This penalty can shrink some coefficients to zero, which helps in selecting only the important features and ignoring the less important ones.

Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ ∑(i=1 to m) |wᵢ|

Where:

2. Ridge Regression

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight w in the network, we add the term
½ λw² to the objective, where λ is the regularization strength. It is common to see the factor of ½ in front because then the gradient of this term with respect to the parameter w is simply λw instead of 2λw. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: w += -lambda * w towards zero.

A regression model that uses the L2 regularization technique is called Ridge regression. It adds the squared magnitude of the coefficient as a penalty term to the loss function (L).

Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ ∑(i=1 to m) (wᵢ²)

Where:

Animation of increasing levels of regularization strength of λ from 0 to 10, showcasing how the penalty surface shapes become increasingly extreme.

3. Elastic Net Regression

It is possible to combine the L1 regularization with the L2 regularization:
λ₁ |w| + λ₂w² (this is called Elastic net regularization).

Elastic Net Regression is a combination of both L1 as well as L2 regularization. That shows that we add the absolute norm of the weights as well as the squared measure of the weights. With the help of an extra hyperparameter that controls the ratio of the L1 and L2 regularization:

Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ [ (1 − α) ∑(i=1 to m) |wᵢ| + α ∑(i=1 to m) (wᵢ²) ]

Where:

Ridge Regression (solid lines) has an L2 penalty and shrinks coefficients towards 0, but very rarely all the way to 0. Lasso Regression (dashed lines) has an L1 penalty and shrinks coefficients all the way towards 0, resulting in sparser solutions.

4. Max norm constraints

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector w of every neuron to satisfy ‖w‖₂ < c. Typical values of c are on the order of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that the network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

5. Dropout

Dropout is an extremely effective, simple, and recently introduced regularization technique that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.

References:

https://www.mdrk.io/regularization-in-machine-learning-part1/